{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DweYe9FcbMK_"
      },
      "source": [
        "##### Copyright 2018 The TensorFlow Authors.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "AVV2e0XKbJeX"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "sUtoed20cRJJ"
      },
      "source": [
        "# Load text"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1ap_W4aQcgNT"
      },
      "source": [
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/tutorials/load_data/text\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/text.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/en/tutorials/load_data/text.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/load_data/text.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "NWeQAo0Ec_BL"
      },
      "source": [
        "This tutorial provides an example of how to use `tf.data.TextLineDataset` to load examples from text files. `TextLineDataset` is designed to create a dataset from a text file, in which each example is a line of text from the original file. This is potentially useful for any text data that is primarily line-based (for example, poetry or error logs).\n",
        "\n",
        "In this tutorial, we'll use three different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fgZ9gjmPfSnK"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "baYFZMW_bJHh"
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import, division, print_function, unicode_literals\n",
        "\n",
        "try:\n",
        "  # %tensorflow_version only exists in Colab.\n",
        "  %tensorflow_version 2.x\n",
        "except Exception:\n",
        "  pass\n",
        "import tensorflow as tf\n",
        "\n",
        "import tensorflow_datasets as tfds\n",
        "import os"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YWVWjyIkffau"
      },
      "source": [
        "The texts of the three translations are by:\n",
        "\n",
        " - [William Cowper](https://en.wikipedia.org/wiki/William_Cowper) — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt)\n",
        "\n",
        " - [Edward, Earl of Derby](https://en.wikipedia.org/wiki/Edward_Smith-Stanley,_14th_Earl_of_Derby) — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt)\n",
        "\n",
        "- [Samuel Butler](https://en.wikipedia.org/wiki/Samuel_Butler_%28novelist%29) — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt)\n",
        "\n",
        "The text files used in this tutorial have undergone some typical preprocessing tasks, mostly removing stuff — document header and footer, line numbers, chapter titles. Download these lightly munged files locally."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "4YlKQthEYlFw"
      },
      "outputs": [],
      "source": [
        "DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'\n",
        "FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']\n",
        "\n",
        "for name in FILE_NAMES:\n",
        "  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)\n",
        "  \n",
        "parent_dir = os.path.dirname(text_dir)\n",
        "\n",
        "parent_dir"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "q3sDy6nuXoNp"
      },
      "source": [
        "## Load text into datasets\n",
        "\n",
        "Iterate through the files, loading each one into its own dataset.\n",
        "\n",
        "Each example needs to be individually labeled, so use `tf.data.Dataset.map` to apply a labeler function to each one. This will iterate over every example in the dataset, returning (`example, label`) pairs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "K0BjCOpOh7Ch"
      },
      "outputs": [],
      "source": [
        "def labeler(example, index):\n",
        "  return example, tf.cast(index, tf.int64)  \n",
        "\n",
        "labeled_data_sets = []\n",
        "\n",
        "for i, file_name in enumerate(FILE_NAMES):\n",
        "  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))\n",
        "  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))\n",
        "  labeled_data_sets.append(labeled_dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "M8PHK5J_cXE5"
      },
      "source": [
        "Combine these labeled datasets into a single dataset, and shuffle it.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "6jAeYkTIi9-2"
      },
      "outputs": [],
      "source": [
        "BUFFER_SIZE = 50000\n",
        "BATCH_SIZE = 64\n",
        "TAKE_SIZE = 5000"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Qd544E-Sh63L"
      },
      "outputs": [],
      "source": [
        "all_labeled_data = labeled_data_sets[0]\n",
        "for labeled_dataset in labeled_data_sets[1:]:\n",
        "  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)\n",
        "  \n",
        "all_labeled_data = all_labeled_data.shuffle(\n",
        "    BUFFER_SIZE, reshuffle_each_iteration=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "r4JEHrJXeG5k"
      },
      "source": [
        "You can use `tf.data.Dataset.take` and `print` to see what the `(example, label)` pairs look like. The `numpy` property shows each Tensor's value."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "gywKlN0xh6u5"
      },
      "outputs": [],
      "source": [
        "for ex in all_labeled_data.take(5):\n",
        "  print(ex)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "5rrpU2_sfDh0"
      },
      "source": [
        "## Encode text lines as numbers\n",
        "\n",
        "Machine learning models work on numbers, not words, so the string values need to be converted into lists of numbers. To do that, map each unique word to a unique integer.\n",
        "\n",
        "### Build vocabulary\n",
        "\n",
        "First, build a vocabulary by tokenizing the text into a collection of individual unique words. There are a few ways to do this in both TensorFlow and Python. For this tutorial:\n",
        "\n",
        "1. Iterate over each example's `numpy` value.\n",
        "2. Use `tfds.features.text.Tokenizer` to split it into tokens.\n",
        "3. Collect these tokens into a Python set, to remove duplicates.\n",
        "4. Get the size of the vocabulary for later use."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "YkHtbGnDh6mg"
      },
      "outputs": [],
      "source": [
        "tokenizer = tfds.features.text.Tokenizer()\n",
        "\n",
        "vocabulary_set = set()\n",
        "for text_tensor, _ in all_labeled_data:\n",
        "  some_tokens = tokenizer.tokenize(text_tensor.numpy())\n",
        "  vocabulary_set.update(some_tokens)\n",
        "\n",
        "vocab_size = len(vocabulary_set)\n",
        "vocab_size"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "0W35VJqAh9zs"
      },
      "source": [
        "### Encode examples\n",
        "\n",
        "Create an encoder by passing the `vocabulary_set` to `tfds.features.text.TokenTextEncoder`. The encoder's `encode` method takes in a string of text and returns a list of integers."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "gkxJIVAth6j0"
      },
      "outputs": [],
      "source": [
        "encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "v6S5Qyabi-vo"
      },
      "source": [
        "You can try this on a single line to see what the output looks like."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "jgxPZaxUuTbk"
      },
      "outputs": [],
      "source": [
        "example_text = next(iter(all_labeled_data))[0].numpy()\n",
        "print(example_text)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "XoVpKR3qj5yb"
      },
      "outputs": [],
      "source": [
        "encoded_example = encoder.encode(example_text)\n",
        "print(encoded_example)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "p9qHM0v8k_Mg"
      },
      "source": [
        "Now run the encoder on the dataset by wrapping it in `tf.py_function` and  passing that to the dataset's `map` method."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "HcIQ7LOTh6eT"
      },
      "outputs": [],
      "source": [
        "def encode(text_tensor, label):\n",
        "  encoded_text = encoder.encode(text_tensor.numpy())\n",
        "  return encoded_text, label\n",
        "\n",
        "def encode_map_fn(text, label):\n",
        "  return tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))\n",
        "\n",
        "all_encoded_data = all_labeled_data.map(encode_map_fn)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_YZToSXSm0qr"
      },
      "source": [
        "## Split the dataset into test and train batches\n",
        "\n",
        "Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to create a small test dataset and a larger training set.\n",
        "\n",
        "Before being passed into the model, the datasets need to be batched. Typically, the examples inside of a batch need to be the same size and shape. But, the examples in these datasets are not all the same size — each line of text had a different number of words. So use `tf.data.Dataset.padded_batch` (instead of `batch`) to pad the examples to the same size."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "r-rmbijQh6bf"
      },
      "outputs": [],
      "source": [
        "train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)\n",
        "train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([-1],[]))\n",
        "\n",
        "test_data = all_encoded_data.take(TAKE_SIZE)\n",
        "test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([-1],[]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Xdz7SVwmqi1l"
      },
      "source": [
        "Now, `test_data` and `train_data` are not collections of (`example, label`) pairs, but collections of batches. Each batch is a pair of (*many examples*, *many labels*) represented as arrays.\n",
        "\n",
        "To illustrate:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "kMslWfuwoqpB"
      },
      "outputs": [],
      "source": [
        "sample_text, sample_labels = next(iter(test_data))\n",
        "\n",
        "sample_text[0], sample_labels[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "UI4I6_Sa0vWu"
      },
      "source": [
        "Since we have introduced a new token encoding (the zero used for padding), the vocabulary size has increased by one."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "IlD1Lli91vuc"
      },
      "outputs": [],
      "source": [
        "vocab_size += 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "K8SUhGFNsmRi"
      },
      "source": [
        "## Build the model\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "QJgI1pow2YR9"
      },
      "outputs": [],
      "source": [
        "model = tf.keras.Sequential()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "wi0iiKLTKdoF"
      },
      "source": [
        "The first layer converts integer representations to dense vector embeddings. See the [Word Embeddings](../../tutorials/sequences/word_embeddings) tutorial for more details. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "DR6-ctbY638P"
      },
      "outputs": [],
      "source": [
        "model.add(tf.keras.layers.Embedding(vocab_size, 64))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_8OJOPohKh1q"
      },
      "source": [
        "The next layer is a [Long Short-Term Memory](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) layer, which lets the model understand words in their context with other words. A bidirectional wrapper on the LSTM helps it to learn about the datapoints in relationship to the datapoints that came before it and after it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "x6rnq6DN_WUs"
      },
      "outputs": [],
      "source": [
        "model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "cdffbMr5LF1g"
      },
      "source": [
        "Finally we'll have a series of one or more densely connected layers, with the last one being the output layer. The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of an example's label."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "QTEaNSnLCsv5"
      },
      "outputs": [],
      "source": [
        "# One or more dense layers.\n",
        "# Edit the list in the `for` line to experiment with layer sizes.\n",
        "for units in [64, 64]:\n",
        "  model.add(tf.keras.layers.Dense(units, activation='relu'))\n",
        "\n",
        "# Output layer. The first argument is the number of labels.\n",
        "model.add(tf.keras.layers.Dense(3, activation='softmax'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zLHPU8q5DLi_"
      },
      "source": [
        "Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. You can try other optimizers, but `adam` is very common."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "pkTBUVO4h6Y5"
      },
      "outputs": [],
      "source": [
        "model.compile(optimizer='adam',\n",
        "              loss='sparse_categorical_crossentropy',\n",
        "              metrics=['accuracy'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DM-HLo5NDhql"
      },
      "source": [
        "## Train the model\n",
        "\n",
        "This model running on this data produces decent results (about 83%)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "aLtO33tNh6V8"
      },
      "outputs": [],
      "source": [
        "model.fit(train_data, epochs=3, validation_data=test_data)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "KTPCYf_Jh6TH"
      },
      "outputs": [],
      "source": [
        "eval_loss, eval_acc = model.evaluate(test_data)\n",
        "\n",
        "print('\\nEval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Q7IWg9UBNUNM"
      },
      "outputs": [],
      "source": [
        ""
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "text.ipynb",
      "private_outputs": true,
      "provenance": [],
      "toc_visible": true,
      "version": "0.3.2"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
